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Abstract 

Gaussian process (GP) models form a core part of probabilistic machine learning. Con¬ 
siderable research effort has been made into attacking three issues with GP models: how 
to compute efficiently when the number of data is large; how to approximate the posterior 
when the likelihood is not Gaussian and how to estimate covariance function parameter 
posteriors. This paper simultaneously addresses these, using a variational approximation 
to the posterior which is sparse in support of the function but otherwise free-form. The 
result is a Hybrid Monte-Carlo sampling scheme which allows for a non-Gaussian approx¬ 
imation over the function values and covariance parameters simultaneously, with efficient 
computations based on inducing-point sparse GPs. Code to replicate each experiment in 
this paper will be available shortly. 


1. Introduction 

Gaussian process models are attractive for machine learning because of their flexible non- 
parametric nature. By combining a GP prior with different likelihoods, a multitude of 
machine learning tasks can be tackled in a probabilistic fashion [1]. There are three things 
to consider when using a GP model: approximation of the posterior function (especially if 
the likelihood is non-Gaussian), computation, storage and inversion of the covariance ma¬ 
trix, which scales poorly in the number of data; and estimation (or marginalization) of the 
covariance function parameters. A multitude of approximation schemes have been proposed 
for efficient computation when the number of data is large. Early strategies were based on 
retaining a sub-set of the data [2, 3]. Snelson and Ghahramani [4] introduced an inducing 
point approach, where the model is augmented with additional variables, and Titsias [5] 
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Table 1: Existing variational approaches 


Reference 

p(y 1 f ) 

Sparse 

Posterior 

Hyperparam. 

Williams & Barber[22] [also 15, 18] 

probit/logit 

X 

Gaussian (assumed) 

point estimate 

Titsias [5] 

Gaussian 

/ 

Gaussian (optimal) 

point estimate 

Chai [19] 

softmax 

/ 

Gaussian (assumed) 

point estimate 

Nguyen and Bonilla [1] 

any factorized 

X 

Mixture of Gaussians 

point estimate 

Hensman et al. [21] 

probit 

/ 

Gaussian (assumed) 

point estimate 

This work 

any factorized 

/ 

free-form 

free-form 


used these ideas in a variational approach. Other authors have introduced approximations 
based on the spectrum of the GP [6, 7], or which exploit specific structures within the 
covariance matrix [8, 9], or by making unbiased stochastic estimates of key computations 
[10]. In this work, we extend the variational inducing point framework, which we prefer for 
general applicability (no specific requirements are made of the data or covariance function), 
and because the variational inducing point approach can be shown to minimize the KL 
divergence to the posterior process [11]. 

To approximate the posterior function and covariance parameters, Markov chain Monte- 
Carlo (MCMC) approaches provide asymptotically exact approximations. Murray and 
Adams [12] and Filippone et al. [13] examine schemes which iteratively sample the func¬ 
tion values and covariance parameters. Such sampling schemes require computation and 
inversion of the full covariance matrix at each iteration, making them unsuitable for large 
problems. Computation may be reduced somewhat by considering variational methods, 
approximating the posterior using some fixed family of distributions [14, 15, 16, 17, 1, 18], 
though many covariance matrix inversions are generally required. Recent works [19, 20, 21] 
have proposed inducing point schemes which can reduce the computation required sub¬ 
stantially, though the posterior is assumed Gaussian and the covariance parameters are 
estimated by (approximate) maximum likelihood. Table 1 places our work in the context 
of existing variational methods for GPs. 

This paper presents a general inference scheme, with the only concession to approx¬ 
imation being the variational inducing point assumption. Non-Gaussian posteriors are 
permitted through MCMC, with the computational benefits of the inducing point frame¬ 
work. The scheme jointly samples the inducing-point representation of the function with 
the covariance function parameters; with sufficient inducing points our method approaches 
full Bayesian inference over GP values and the covariance parameters. We show empirically 
that the number of required inducing points is substantially smaller than the dataset size 
for several real problems. 

2. Stochastic process posteriors 

The model is set up as follows. We are presented with some data inputs X = {x n }n=i 
and responses y = {y n }^ =1 . A latent function is assumed drawn from a GP with zero 
mean and covariance function fc(x, x') with (hyper-) parameters 0. Consistency of the 
GP means that only those points with data are considered: the latent vector f represents 
the values of the function at the observed points f = {/(x n )}^ =1 , and has conditional 
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distribution p(f | X, 6) — A/*(f | 0, Kjj), where K ff is a matrix composed of evaluating 
the covariance function at all pairs of points in X. The data likelihood depends on the 
latent function values: p(y | f). To make a prediction for latent function value test points 
f* = {/(x*)} x *ex *7 the posterior function values and parameters are integrated: 

p{ f* I y) = Jj p( f* I f, 0)p( f, e I y) do df. (1) 


In order to make use of the computational savings offered by the variational inducing point 
framework [5], we introduce additional input points to the function Z and collect the re¬ 
sponses of the function at that point into the vector u = {u m = /( z m )}^ =1 . With some 
variational posterior q{ u, 0 ), new points are predicted similarly to the exact solution 

9(f*) = J J P( f * I u > *Mu, 0) d6> du. (2) 

This makes clear that the approximation is a stochastic process in the same fashion as the 
true posterior: the length of the predictions vector f* is potentially unbounded, covering 
the whole domain. 

To obtain a variational objective, first consider the support of u under the true posterior, 
and for f under the approximation. In the above, these points are subsumed into the 
prediction vector f*: from here we shall be more explicit, letting f be the points of the 
process at X, u be the points of the process at Z and f* be a large vector containing all 
other points of interest 1 . All of the free parameters of the model are then f*,f, u, 0, and 
using a variational framework, we aim to minimize the Kullback-Leibler divergence between 
the approximate and true posteriors: 


p(f* I u > f, Q)p( u I f, Q)p{ f, 01 y) ~ 
p(f*|u,f,0)p(f \u,G)q(u, 6) 

(3) 

where the conditional distributions for f* have been expanded to make clear that they are 
the same under the true and approximate posteriors, and X, Z and X* have been omitted 
for clarity. Straight-forward identities simplify the expression, 


K, = KL[g(f*,f,u,0)||p(f*,f,u,0|y)] = -IE 

q( f*,f,u,0) 


JC = -E, 


q( f,u,0) 


= —E 


q( f,u,0) 


log 

log 


p(u | f, 0)p(i | 0)p(0)y(y | f)/p(y) 
P(f |u, 0)q( u,6>) 
p{ u I o)p(0)p( y I f) 


q(u,9) 


+ logp(y), 


(4) 


resulting in the variational inducing-point objective investigated by Titsias [5], aside from 
the inclusion of 0. This can be rearranged to give the following informative expression 


/C = KL 


( i ,M|P(u|0)p(0)Ep ( f|u,e ) [logp(y|f)]' 

g ( u > d ) 11 - -77-^ - 


— log C + log p(y ). 


(5) 


1. The vector f* here is considered finite but large enough to contain any point of interest for prediction. 
The infinite case follows Matthews et al. [11], is omitted here for brevity, and results in the same solution. 
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Here C is an intractable constant which normalizes the distribution and is independent of 
q. Minimizing the KL divergence on the right hand side reveals that the optimal variational 
distribution is 

log q( u, 0) = Ep(f | u , 0 ) [log p(y | f )] + logp(u | 0 ) + logp(0) - log C. (6) 

For general likelihoods, since the optimal distribution does not take any particular form, 
we intend to sample from it using MCMC. This is feasible using standard methods since it 
is computable up to a constant, using 0(NM 2 ) computations. To sample effectively, the 
following are proposed. 

Whitening the prior Noting that the problem (6) appears similar to a standard GP for 
u, albeit with an interesting ‘likelihood’, we make use of an ancillary augmentation u = Rv, 
with RR t = K uui v ~ A/*(0,1). This results in the optimal variational distribution 

log<?(v, 0) = E p(f | u=Rv) [log p(y | f)] + logp(v) + logp(0) - logC (7) 

Previously [12, 13] this parameterization has been used with schemes which alternate be¬ 

tween sampling the latent function values (represented by v or u) and the parameters 0. 
Our scheme uses HMC across v and 6 jointly, whose effectiveness is examined throughout 
the experiment section. 

Quadrature The first term in (6) is the expected log-likelihood. In the case of factoriza¬ 
tion across the data-function pairs, this results in N one-dimensional integrals. For Gaussian 
or Poisson likelihood these integrals are tractable, otherwise they can be approximated by 
Gauss-Hermite quadrature. Given the current sample v, the expectations are computed 
w.r.t. p(f n I V, 0) = J\f(n n , 7 n), With: 

/it = A t v; 7 = diag(K// - A T A); A = R -1 K U /; RR T = K uu , ( 8 ) 

where the kernel matrices are computed similarly to Kjj, but over the pairs in 

(X, Z),(Z, Z) respectively. From here, one can compute the expected likelihood and it is 
subsequently straight-forward to compute derivatives in terms of K u j, diag(Kjj) and R. 

Reverse mode differentiation of Cholesky To compute derivatives with respect to 6 
and Z we use reverse-mode differentiation (backpropagation) of the derivative through the 
Cholesky matrix decomposition, transforming dlog g(v, 0)/d R into dlog g(v, 0)/d K nn , and 
then dlog q(v,0)/d0. This is discussed by Smith [23], and results in a 0(M 3 ) operation; 
an efficient Cython implementation is provided in the supplement. 

3. Treatment of inducing point positions & inference strategy 

A natural question is, what strategy should be used to select the inducing points Z? In the 
original inducing point formulation [4], the positions Z were treated as parameters to be 
optimized. One could interpret them as parameters of the approximate prior covariance [24]. 
The variational formulation [5] treats them as parameters of the variational approximation, 
thus protecting from over-fitting as they form part of the variational posterior. In this work, 
since we propose a Bayesian treatment of the model, we question whether it is feasible to 
treat Z in a Bayesian fashion. 
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Since u and Z are auxiliary parameters, the form of their distribution does not affect 
the marginals of the model. The term p(u | Z) has been defined by the consistency with the 
GP in order to preserve the posterior-process interpretation above (i.e. u should be points 
on the GP), but we are free to choose p( Z). Omitting dependence on 6 for clarity, and 
choosing w.l.o.g. q( u, Z) = q( u | Z)q(Z), the bound on the marginal likelihood, similarly to 
(4) is given by 


£ — ®p(f | u,Z)g(u | Z)g(Z) 


log 


p(y I f)p( u I z)p(z) 

q(u\Z)q(Z) 


(9) 


The bound can be maximized w.r.t p( Z) by noting that the term only appears inside a 
(negative) KL divergence: — E g (z)[logg(Z)/p(Z)]. Substituting the optimal p(Z) = q(Z) 
reduces (9) to 


C = E, 


<?(Z) 


E. 


'P(f I u ,Z)q(u | Z) 


log 


p(y I f )p(u I z) 

9 (u|Z) 


( 10 ) 


which can now be optimized w.r.t. q(Z). Since no entropy term appears for q( Z), the bound 
is maximized when the distribution becomes a Dirac’s delta. In summary, since we are free 
to choose a prior for Z which maximizes the amount of information captured by u, the 
optimal distribution becomes p( Z) = q(Z) = S( Z — Z). This formally motivates optimizing 
the inducing points Z. 


Derivatives for Z For completeness we also include the derivative of the free form ob¬ 
jective with respect to the inducing point positions. Substituting the optimal distribution 
q( u, 6) into (4) to give /C and then differentiating we obtain 


dfc 

dZ 


dlogC 

dZ 


= -%v,0) 


d 

0Z Ep ( f l u: 


:Rv) [logp(y I f)] 


( 11 ) 


Since we aim to draw samples from q(v, 0), evaluating this free form inducing point gradient 
using samples seems plausible but challenging. Instead we use the following strategy. 


1 . Fit a Gaussian approximation to the posterior. We follow [21] in fitting a 
Gaussian approximation to the posterior. The positions of the inducing points are initial¬ 
ized using k-means clustering of the data. The values of the latent function are represented 
by a mean vector (initialized randomly) and a lower-triangular matrix L forms the approx¬ 
imate posterior covariance as LL T . For large problems (such as the MNIST experiment), 
stochastic optimization using AdaDelta is used. Otherwise, LBFGS is used. After a few 
hundred iterations with the inducing points positions fixed, they are optimized in free-form 
alongside the variational parameters and covariance function parameters. 


2 . Initialize the model using the approximation. Having found a satisfactory 
approximation, the HMC strategy takes the optimized inducing point positions from the 
Gaussian approximation. The initial value of v is drawn from the Gaussian approximation, 
and the covariance parameters are initialized at the (approximate) MAP value. 

3. Tuning HMC The HMC algorithm has two free parameters to tune, the number of 
leapfrog steps and the step-length. We follow a strategy inspired by Wang et al. [25], where 
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the number of leapfrog steps is drawn randomly from 1 to L max , and Bayesian optimiza¬ 
tion is used to maximize the expected square jump distance (ESJD), penalized by y/L max . 
Rather than allow an adaptive (but convergent) scheme as [25], we run the optimization for 
30 iterations of 30 samples each, and use the best parameters for a long run of HMC. 

4. Run tuned HMC to obtain predictions Having tuned the HMC, it is run for 

several thousand iterations to obtain a good approximation to g(v, 6). The samples are used 
to estimate the integral in equation (2). The following section investigates the effectiveness 
of the proposed sampling scheme. 

4. Experiments 

4.1 Efficient sampling using Hamiltonian Monte Carlo 

This section illustrates the effectiveness of Hamiltonian Monte Carlo in sampling from 
g(v, 6). As already pointed out, the form assumed by the optimal variational distribution 
g(v, 6) in equation (6) resembles the joint distribution in a GP model with a non-Gaussian 
likelihood. 

For a fixed 0, sampling v is relatively straightforward, and this is possible to be done 
efficiently using HMC [13, 26, 27] or Elliptical Slice Sampling [28]. A well tuned HMC 
has been reported to be extremely efficient in sampling the latent variables, and this mo¬ 
tivates our effort into trying to extend this efficiency to the sampling of hyper-parameters 
as well. This is also particularly appealing due to the convenience offered by the proposed 
representation of the model. 

The problem of drawing samples from the posterior distribution over v, 6 has been in¬ 
vestigated in detail in [12, 13]. In these works, it has been advocated to alternate between 
the sampling of v and 6 in a Gibbs sampling fashion and condition the sampling of 6 on a 
suitably chosen transformation of the latent variables. For each likelihood model, we com¬ 
pare efficiency and convergence speed of the proposed HMC sampler with a Gibbs sampler 
where v is sampled using HMC 0 is sampled using the Metropolis-Hastings algorithm. To 
make the comparison fair, we imposed the mass matrix in HMC and the covariance in MH 
to be isotropic, and any parameters of the proposal were tuned using Bayesian optimiza¬ 
tion. Unlike in the proposed HMC sampler, for the Gibbs sampler we did not penalize the 
objective function of the Bayesian optimization for large numbers of leapfrog steps, as in 
this case HMC proposals on the latent variables are computationally cheaper than those on 
the hyper-parameters. We report efficiency in sampling from g(v, 0) using Effective Sam¬ 
ple Size (ESS) and Time Normalized (TN)-ESS. In the supplement we include convergence 
plots based on the Potential Scale Reduction Factor (PSRF) computed based on ten parallel 
chains; in these each chain is initialized from the VB solution and individually tuned using 
Bayesian optimization. 

4.2 Binary Classification 

We first use the image dataset [29] to investigate the benefits of the approach over a Gaussian 
approximation, and to investigate the effect of changing the number of inducing points, as 
well as optimizing the inducing points under the Gaussian approximation. The data are 
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number of inducing points number of inducing points 

Figure 1: Performance of the method on the image dataset, with one lengthscale per di¬ 
mension. Left, box-plots show performance for varying numbers of inducing points and Z 
strategies. Optimizing Z using the Gaussian approximation offers significant improvement 
over the k-means strategy. Right: improvement of the performance of the Gaussian approx¬ 
imation method, with the same inducing points. The method offers consistent performance 
gains when the number of inducing points is larger. The supplement contains a similar 
figure with only a single lengthscale. 


18 dimensional: we investigated the effect of our approximation using both ARD (one 
lengthscale per dimension) and an isotropic RBF kernel. The data were split randomly 
into 1000/1019 train/test sets; the log predictive density over ten random splits is shown in 
Figure 1. 

Following the strategy outlined above, we fitted a Gaussian approximation to the pos¬ 
terior, with Z initialized with k-means. Figure 1 investigates the difference in performance 
when Z is optimized using the Gaussian approximation, compared to just using k-means 
for Z. Whilst our strategy is not guaranteed to find the global optimum, it is clear that it 
improves the performance. 

The second part of Figure 1 shows the performance improvement of our sampling ap¬ 
proach over the Gaussian approximation. We drew 10,000 samples, discarding the first 1000: 
we see a consistent improvement in performance once M is large enough. For small M, The 
Gaussian approximation appears to work very well. The supplement contains a similar Fig¬ 
ure for the case where a single lengthscale is shared: there, the improvement of the MCMC 
method over the Gaussian approximation is smaller but consistent. We speculate that the 
larger gains for ARD are due to posterior uncertainty in the lengthscale parameters, which 
is poorly represented by a point in the Gaussian/MAP approximation. 

The ESS and TN-ESS are comparable between HMC and the Gibbs sampler. In par¬ 
ticular, for 100 inducing points and the RBF covariance, ESS and TN-ESS for HMC are 11 
and 1.0* 10 -3 and for the Gibbs sampler are 53 and 5.1 • 10 -3 . For the ARD covariance, ESS 
and TN-ESS for HMC are 14 and 5.1 • 10 -3 and for the Gibbs sampler are 1.6 and 1.5 • 10 -4 . 
Convergence, however, seems to be faster for HMC, especially for the ARD covariance (see 
the supplement). 
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Figure 2: The posterior of the rates for the coal mining disaster data. Left: posterior 
rates using our variational MCMC method and a Gaussian approximation. Data are shown 
as vertical bars. Right: posterior samples for the covariance function parameters using 
MCMC. The Gaussian approximation estimated the parameters as (12.06, 0.55). 


4.3 Log Gaussian Cox Processes 

We apply our methods to Log Gaussian Cox processes [30]: doubly stochastic models where 
the rate of an inhomogeneous Poisson process is given by a Gaussian process. The main 
difficulty for inference lies in that the likelihood of the GP requires an integral over the 
domain, which is typically intractable. For low dimensional problems, this integral can 
be approximated on a grid; assuming that the GP is constant over the width of the grid 
leads to a factorizing Poisson likelihood for each of the grid points. Whilst some recent 
approaches allow for a grid-free approach [20, 31], these usually require concessions in the 
model, such as an alternative link function, and do not approach full Bayesian inference 
over the covariance function parameters. 

Coal mining disasters On the one-dimensional coal-mining disaster data. We held out 
50% of the data at random, and using a grid of 100 points with 30 evenly spaced inducing 
points Z, fitted both a Gaussian approximation to the posterior process with an (approx¬ 
imate) MAP estimate for the covariance function parameters (variance and lengthscale of 
an RBF kernel). With Gamma priors on the covariance parameters we ran our sampling 
scheme using HMC, drawing 3000 samples. The resulting posterior approximations are 
shown in Figure 2, alongside the true posterior using a sampling scheme similar to ours 
(but without the inducing point approximation). The free-form variational approximation 
matches the true posterior closely, whilst the Gaussian approximation misses important 
detail. The approximate and true posteriors over covariance function parameters are shown 
in the right hand part of Figure 2, there is minimal discrepancy in the distributions. 

Over 10 random splits of the data, the average held-out log-likelihood was —1.229 for the 
Gaussian approximation and —1.225 for the free-form MCMC variant; the average difference 
was 0.003, and the MCMC variant was always better than the Gaussian approximation. We 
attribute this improved performance to marginalization of the covariance function parame¬ 
ters. 

Efficiency of HMC is greater than for the Gibbs sampler; ESS and TN-ESS for HMC 
are 6.7 and 3.1 • 10 -2 and for the Gibbs sampler are 9.7 and 1.9-10 -2 . Also, chains converge 
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Figure 3: Pine sapling data. From left to right: reported locations of pine saplings; posterior 
mean intensity on a 32x32 grid using full MCMC; posterior mean intensity on a 32x32 grid 
(with sparsity using 225 inducing points), posterior mean intensity on a 64x64 grid (using 
225 inducing points). The supplement contains a larger version of this figure. 


within few thousand iterations for both methods, although convergence for HMC is faster 
(see the supplement). 

Pine saplings The advantages of the proposed approximation are prominent as the num¬ 
ber of grid points become higher, an effect emphasized with increasing dimension of the 
domain. We fitted a similar model to the above to the pine sapling data [30]. 

We compared the sampling solution obtained using 225 inducing points on a 32 x 32 grid 
to the gold standard full MCMC run with the same prior and grid size. Figure 3 shows that 
the agreement between the variational sampling and full sampling is very close. However 
the variational method was considerably faster. Using a single core on a desktop computer 
required 3.4 seconds to obtain 1 effective sample for a well tuned variational method whereas 
it took 554 seconds for well tuned full MCMC. This effect becomes even larger as we increase 
the resolution of the grid to 64 x 64, which gives a better approximation to the underlying 
smooth function as can be seen in figure 3. It took 4.7 seconds to obtain one effective sample 
for the variational method, but now gold standard MCMC comparison was computationally 
extremely challenging to run for even a single HMC step. This is because it requires linear 
algebra operations using 0(N 3 ) flops with N = 4096. 


4.4 Multi-class Classification 

To do multi-class classification with Gaussian processes, one latent function is defined for 
each of the classes. The functions are defined a-priori independent, but covary a poste¬ 
riori because of the likelihood. Chai [19] studies a sparse variational approximation to 
the softmax multi-class likelihood restricted to a Gaussian approximation. Here, following 
[32, 33, 34], we use a robust-max likelihood. Given a vector f n containing K latent functions 
evaluated at the point x n , the probability that the label takes the integer value y n is 


P(yn\fn) 


1 - e, if y n = argmax f n 

e/(K — 1), otherwise. 


( 12 ) 


As Girolami and Rogers [32] discuss, the ‘soft’ probit-like behaviour is recovered by adding 
a diagonal ‘nugget’ to the covariance function. In this work, e was fixed to 0.001, though it 
would also be possible to treat this as a parameter for inference. The expected log-likelihood 
is E p ( fn | v? 0 ) [log p(y n | f n )] = plog(e) + (1 -p) log (e/(K - 1)), where p is the probability that 
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Figure 4: A toy multiclass problem. Left: the Gaussian approximation, colored points show 
the simulated data, lines show posterior probability contours at 0.3, 0.95, 0.99. Inducing 
points positions shows as black points. Middle: the free form solution with 10,000 posterior 
samples. The free-form solution is more conservative (the contours are smaller). Right: 
posterior samples for v at the same position but across different latent functions. The 
posterior exhibits strong correlations and edges. 


the labelled function is largest, which is computable using one-dimensional quadrature. An 
efficient Cython implementation is contained in the supplement. 

Toy example To investigate the proposed posterior approximation for the multivariate 
classification case, we turn to the toy data shown in Figure 4. We drew 750 data points from 
three Gaussian distributions. The synthetic data was chosen to include non-linear decision 
boundaries and ambiguous decision areas. Figure 4 shows that there are differences between 
the variational and sampling solutions, with the sampling solution being more conservative 
in general (the contours of 95% confidence are smaller). As one would expect at the decision 
boundary there are strong correlations between the functions which could not be captured 
by the Gaussian approximation we are using. Note the movement of inducing points away 
from k-means and towards the decision boundaries. 

Efficiency of HMC and the Gibbs sampler is comparable. In the RBF case, ESS and 
TN-ESS for HMC are 1.9 and 3.8 • 10 -4 and for the Gibbs sampler are 2.5 and 3.6 • 10 -4 . In 
the ARD case, ESS and TN-ESS for HMC are 1.2 and 2.8 • 10 -3 and for the Gibbs sampler 
are 5.1 and 6.8 • 10 -4 . For both cases, the Gibbs sampler struggles to reach convergence 
even though the average acceptance rates are similar to those recommended for the two 
samplers individually. 

MNIST The MNIST dataset is a well studied benchmark with a defined training/test 
split. We used 500 inducing points, initialized from the training data using k-means. A 
Gaussian approximation was optimized using minibatch-based optimization over the means 
and variances of q( u), as well as the inducing points and covariance function parameters. 
The accuracy on the held-out data was 98.04%, significantly improving on previous ap¬ 
proaches to to classify these digits using GP models. 

For binary classification, Hensman et al. [21] reported that their Gaussian approximation 
resulted in movement of the inducing point positions toward the decision boundary. The 
same effect appears in the multivariate case, as shown in Figure 5, which shows three of 
the 500 inducing points used in the MNIST problem. The three examples were initialized 
close to the many six digits, and after optimization have moved close to other digits (five 
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Figure 5: Left: three k-means centers used to initialize the inducing point positions. Center: 
the positions of the same inducing points after optimization. Right: difference. 


and four). The last example still appears to be a six, but has moved to a more ‘unusual’ 
six shape, supporting the function at another extremity. Similar effects are observed for all 
inducing-point digits. Having optimized the inducing point positions with the approximate 
g(v), and estimate for 0, we used these optimal inducing points to draw samples from v and 
0. This did not result in an increase in accuracy, but did improve the log-density on the test 
set from -0.068 to -0.064. Evaluating the gradients for the sampler took approximately 0.4 
seconds on a desktop machine, and we were easily able to draw 1000 samples. This dataset 
size has generally be viewed as challenging in the GP community and consequently there 
are not many published results to compare with. One recent work [35] reports a 94.05% 
accuracy using variational inference and a GP latent variable model. 

5. Discussion 

We have presented an inference scheme for general GP models. The scheme significantly re¬ 
duces the computational cost whilst approaching exact Bayesian inference, making minimal 
assumptions about the form of the posterior. The improvements in accuracy in compar¬ 
ison with the Gaussian approximation of previous works has been demonstrated, as has 
the quality of the approximation to the hyper-parameter distribution. Our MCMC scheme 
was shown to be effective for several likelihoods, and we note that the automatic tuning 
of the sampling parameters worked well over hundreds of experiments. This paper shows 
that MCMC methods are feasible for inference in large GP problems, addressing the unfair 
sterotype of ‘slow’ MCMC. 
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Supplementary material for: 

MCMC for Variationally Sparse GPs 


5.1 Coal data 

Figure 6 replicates Figure 1, but with a single lengthscale shared across each input. 
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Figure 6: Performance of the method on the image dataset, with a single lengthscale. 


5.2 Convergence plots 

Convergence of the samplers on the Image dataset is reported in fig. 7 and shows the 
evolution of the PSRF for the twenty slowest parameters for HMC and the Gibbs sampler in 
the case of RBF and ARD covariances. The figure shows that HMC consistently converges 
faster than the Gibbs sampler for both covariances, even when the ESS of the slowest 
variable is comparable. 

Fig. 7 shows the convergence analysis on the coal dataset. In this case, HMC converges 
faster than the Gibbs sampler and efficiency is comparable. 

Convergence of the samplers on the toy multi-class dataset is reported in fig. 9. HMC 
converges much faster than the Gibbs sampler even though efficiency measured through 
ESS is comparable. 
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Figure 7: Image dataset - Evolution of the PSRF of the twenty least efficient parameter 
traces for HMC (blue) and the Gibbs sampler (red). Left panel: RBF case - minimum ESS 
and TN-ESS for HMC are 11 and 1.0 • 10 -3 and for the Gibbs sampler are 53 and 5.1 • 10 -3 . 
Right panel: ARD case - minimum ESS and TN-ESS for HMC are 14 and 5.1 • 10 -3 and 
for the Gibbs sampler are 1.6 and 1.5 • 10 -4 . 



Figure 8: Coal dataset - Evolution of the PSRF of the twenty least efficient parameter 
traces for HMC (blue) and the Gibbs sampler (red). Minimum ESS and TN-ESS for HMC 
are 6.7 and 3.1 • 10 -2 and for the Gibbs sampler are 9.7 and 1.9 • 10 -2 . 




iteration 


iteration 


Figure 9: Multiclass dataset - Evolution of the PSRF of the twenty least efficient parameter 
traces for HMC (blue) and the Gibbs sampler (red). Left panel: RBF case - minimum ESS 
and TN-ESS for HMC are 1.9 and 3.8-10 -4 and for the Gibbs sampler are 2.5 and 3.6-10 -4 . 
Right panel: ARD case - minimum ESS and TN-ESS for HMC are 1.2 and 2.8 • 10 -3 and 
for the Gibbs sampler are 5.1 and 6.8 • 10 -4 . 
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5.3 Pine saplings 






12.0 


10.5 


9.0 


7.5 


6.0 


4.5 


3.0 


1.5 


0.0 


Figure 10: A larger version of Figure 3. Top right: gold standard MCMC 32x32 grid. 
Bottom left: Variational MCMC 32x32 grid. Bottom right: Variational MCMC 64x64 grid, 
with 225 inducing points in the non-exact case. 
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